Out[1]:

eclairs.png

An Application of Recommender Systems on Dessert Ecommerce

Executive Summary

If you're a foodie like LT7 members, you know the struggle of deciding what to cook for your next meal. With thousands of recipes available online, it can be overwhelming to choose just one. That's where a personalized recommendation system comes in handy. And that's precisely what this project aims to achieve for dessert lovers on food.com!

The pandemic has led to a surge in home cooking, and food.com has become a go-to destination for many recipe seekers. However, the platform lacks a recommendation system to help users navigate through the vast collection of recipes available. This project addresses that gap by creating a recommender system based on user preferences.

Using a Kaggle dataset of scraped recipes and user interactions from food.com, the team implemented a comprehensive data science pipeline to generate personalized recommendations for clustered users and items. The dataset was pre-processed to remove duplicates and filter out unnecessary data, and dimensionality reduction techniques were applied to manage computational efficiency and memory constraints. The team used clustering techniques to identify user and item clusters, and various collaborative filtering methods were used to generate recommendations based on user ratings.

The study found that the most effective was the latent-factor-based collaborative filtering technique which provided personalized recommendations with high coverage, balanced novelty score, and high intra-list similarity score. This means that users can get recommendations that are tailored to their preferences while still exploring new options. The study also recommends further improvements by applying other clustering techniques and algorithms, content-based collaborative filtering, and exploring other food categories.

With this project, food.com users can enjoy a personalized recommendation system that improves their experience and engagement on the platform. Whether you're a dessert enthusiast or a curious recipe seeker, this recommender system will help you find your next favorite dessert recipe in no time!

Background

The recent pandemic has led to a renewed interest in home cooking. With the closure of food establishments, people were left with little choice but to prepare their meals. According to a survey conducted by Hunter [1], a food and beverage marketing agency, 54% of Americans said they were cooking more at home since the pandemic started, 67% had increased confidence in their cooking abilities, and 48% said they were trying new recipes. A similar survey conducted by the International Food Information Council [2] reflected similar results, with 60% of respondents saying they are cooking at home more often.

Food recipe websites played a crucial role in this trend. They provided quick access to a wide selection of recipes, allowing users to search for recipes based on the available ingredients and cooking materials they had. In addition, with concerns about health and wellness during the pandemic, many people turned to these websites for ideas on how to prepare healthy meals. Food websites also offered resources on healthy eating, including tips on portion control and developing meal plans. The more popular websites also include forums and comment sections where users can discuss with one another. These gave home cooks a sense of community during a time of extended isolation.

Among the most popular food websites is food.com [3]. Founded in 2004, food.com is a collaborative platform where home cooks can share their favorite recipes and culinary creations. It has one of the largest and still growing collections of recipes online, rivaled only by other big food websites such as allrecipes.com and foodnetwork.com. It has one of the most active communities of users who rate and review recipes that they have tried, as well as share cooking tips and tricks with beginners. More recently, food.com introduced a meal planning tool and a shopping list tool wherein users can choose recipes from the collection and these tools will return a schedule of dishes for each day and a list of ingredients to buy.

One of the shortcomings of food.com is its lack of a personalized recommendation system. Such a system helps users on streaming websites like YouTube and Netflix to decide on what to watch next, with the latter going as far as offering a $1 million prize for developers who can beat its algorithm back in 2006 [4]. Recommender systems are commonly used by e-commerce platforms such as Amazon and Shopee to suggest products to their users based on their browsing and purchase history. This helps to increase customer satisfaction, reduce search costs, and improve sales.

Problem Statement

problem_statement.png

Figure 1. Overview of the Problem

Motivation

Currently, food.com has a fixed set of recommendations based on the current recipe that a user is viewing. With the thousands of recipes available on the platform, users will benefit from a system that recommends recipes based on the previous recipes that a user has rated well. For beginners, recommending similar recipes will help them achieve mastery of a certain type of dish instead of doing a mediocre job on a wide variety of dishes. This is especially true for desserts since they are known to have a higher learning curve compared to other types of dishes. Recommending the correct recipes makes them more likely to continue cooking and in turn, continue visiting the website. Recommended recipes with similar ingredients also help in managing stock and reducing spoilage. More importantly, such systems help in increasing website traffic and user engagement by minimizing the user's search fatigue—showing users what they want to see with minimal effort on their part. The less time users spend on searching, the more time they can allot to writing reviews, posting comments in the forums, and trying out the website's lesser-known features.

The choice of focus is intentional. Recommender systems work best when there is high variability in the recommendable items in the system. This makes it more challenging to create such systems in niche-specific applications. The team wanted to modify the usual algorithms in recommender systems to include clustering techniques. The effect of the modification will not be apparent if the data already performs well for vanilla implementations.

There is no difference between recommender systems for food recipes and recommender systems for food orders. Naturally, this work can be extended to dessert e-commerce shops. The team was able to identify two local shops that will benefit greatly from a recommender system: Kukido [5] and Lacher Patisserie [6]. Both stores have an e-commerce platform, but they do not provide customized product recommendations to their users. Instead, they only display their monthly best sellers. This is a missed opportunity for returning customers who have a hard time deciding what else to buy. Having personalized recommendations will improve user experience and increase the companies' sales.

Data Source

data_source.png

Figure 2. Overview of the Data Source

The dataset comes from a Kaggle dataset of scraped recipes and user interactions from food.com. The dataset is available in the jojie public dataset repository.

Filepath: /mnt/data/public/food-com-recipes

Dataset Summary:

  • 180K+ recipes
  • 700k+ recipe reviews
  • 18+ years of user interactions

Original Data Source: This dataset consists of 180K+ recipes and 700K+ recipe reviews covering 18 years of user interactions and uploads on Food.com (formerly GeniusKitchen). It was used in the following paper:

Generating Personalized Recipes from Historical User Preferences
Bodhisattwa Prasad Majumder, Shuyang Li, Jianmo Ni, Julian McAuley
EMNLP, 2019
https://www.aclweb.org/anthology/D19-1613/

The dataset can be downloaded from Kaggle: https://www.kaggle.com/datasets/shuyangli94/food-com-recipes-and-user-interactions

Content: This dataset contains three sets of data from Food.com:

Interaction Splits:

  • interactions_test.csv
  • interactions_validation.csv
  • interactions_train.csv

Preprocessed data for result reproduction:

In this format, the recipe text metadata is tokenized via the GPT subword tokenizer with start-of-step, etc. tokens.

  • PP_recipes.csv
  • PP_users.csv

Raw data:

  • RAW_interactions.csv
  • RAW_recipes.csv

PP_Recipes Dataset

  • This dataset contains the preprocessed data for recipes.

PP_users Dataset

  • This dataset contains the preprocessed data for user information.

RAW_interactions Dataset

  • This dataset contains the raw data for interactions.

RAW_recipes Dataset

  • This dataset contains the raw data for recipes.

interactions_test Dataset

  • This dataset contains the test split for recipe interactions ("reviews").

interactions_train Dataset

  • This dataset contains the training split for recipe interactions ("reviews").

interactions_validation Dataset

  • This dataset contains the validation split for recipe interactions ("reviews")


Data Exploration

Raw Data Description

Table 1. PP_Recipes Raw Data Description

Rows: 178,265
Columns: 8
No Null Values
Variable Name
Data type
Variable Caterogy
Description
id integer nominal Recipe ID
i integer nominal Recipe ID mapped to contiguous integers from 0
name_tokens object list(string) BPE-tokenized recipe name
ingredient_tokens object list(string) BPE-tokenized ingredients list (list of lists)
steps_tokens integer list(string) BPE-tokenized steps
techniques integer list(string) List of techniques used in recipe
calorie_level integer categorical Calorie level in ascending order
ingredient_ids object list(string) IDs of ingredients in recipe
id i calorie_level
count 178265 178265 178265
mean 213462 89132 0.863192
std 138267 51460.8 0.791486
min 38 0 0
25% 94576 44566 0
50% 196312 89132 1
75% 320562 133698 2
max 537716 178264 2

Table 2. PP_Users Raw Data Description

Rows: 25,076
Columns: 6
No Null Values
Variable Name
Data type
Variable Category
Description
u integer ordinal User ID mapped to contiguous integer sequence from 0
techniques object list(string) Cooking techniques encountered by user
items object list(string) Recipes interacted with, in order
n_items integer nominal Number of recipes reviewed
ratings object list(string) Ratings given to each recipe encountered by this user
n_ratings integer nominal Number of ratings in total
u n_items n_ratings
count 25076 25076 25076
mean 12537.5 27.8713 27.8713
std 7238.96 122.729 122.729
min 0 2 2
25% 6268.75 3 3
50% 12537.5 6 6
75% 18806.2 16 16
max 25075 6437 6437

Table 3. RAW_Interactions Raw Data Description

Rows: 1,132,367
Columns: 5
review: 169 null values
Variable Name
Data type
Variable Category
Description
user_id integer nominal User ID
recipe_id integer nominal Recipe ID
date string string Date of interaction
rating integer nominal Rating given
review string string Review text
user_id recipe_id rating
count 1.13237e+06 1.13237e+06 1.13237e+06
mean 1.38429e+08 160897 4.41102
std 5.01427e+08 130399 1.26475
min 1533 38 0
25% 135470 54257 4
50% 330937 120547 5
75% 804550 243852 5
max 2.00237e+09 537716 5

Table 4. RAW_Recipes Raw Data Description

Rows: 231,637
Columns: 12
name: 1 null value
description: 4,979 null values
Variable Name
Data type
Variable Category
Description
name string string Recipe name
id integer nominal Recipe ID
minutes integer nominal Minutes to prepare recipe
contributor_id integer nominal User ID who submitted this recipe
submitted string string Date recipe was submitted
tags object list(string) Food.com tags for recipe
nutrition object list(string) Nutrition information
n_steps integer nominal Number of steps in recipe
steps object list(string) Text for recipe steps, in order
description string string User-provided description
id minutes contributor_id n_steps n_ingredients
count 231637 231637 231637 231637 231637
mean 222015 9398.55 5.53489e+06 9.7655 9.05115
std 141207 4.46196e+06 9.97914e+07 5.99513 3.7348
min 38 0 27 0 1
25% 99944 20 56905 6 6
50% 207249 40 173614 9 9
75% 333816 65 398275 12 11
max 537716 2.14748e+09 2.00229e+09 145 43

Table 5. Interactions_test Raw Data Description

Rows: 12,455
Columns: 6
No Null Values
Variable Name
Data type
Variable Category
Description
user_id string nominal User ID
recipe_id integer nominal Recipe ID
date string string Date of interaction
rating float nominal Rating given
u integer nominal User ID, mapped to contiguous integers from 0
i integer nominal Recipe ID, mapped to contiguous integers from 0
user_id recipe_id rating u i
count 12455 12455 12455 12455 12455
mean 2.91269e+07 209323 4.21309 12288.5 115488
std 2.33436e+08 135002 1.3385 6897.75 50448.7
min 1533 120 0 2 102
25% 169842 94616 4 6428.5 76904
50% 382954 195040 5 12023 127793
75% 801637 314928 5 17985.5 160024
max 2.00225e+09 537716 5 25074 178264

Table 6. Interactions_train Raw Data Description

Rows: 698,901
Columns: 6
No Null Values
Variable Name
Data type
Variable Category
Description
user_id string nominal User ID
recipe_id integer nominal Recipe ID
date string string Date of interaction
rating float nominal Rating given
u integer nominal User ID, mapped to contiguous integers from 0
i integer nominal Recipe ID, mapped to contiguous integers from 0
user_id recipe_id rating u i
count 698901 698901 698901 698901 698901
mean 1.24769e+07 156173 4.57409 4249.33 87519.3
std 1.52503e+08 126595 0.959022 5522.6 51290.4
min 1533 38 0 0 0
25% 105988 53169 4 455 42988
50% 230102 116484 5 1737 87424
75% 480195 234516 5 5919 131731
max 2.00231e+09 537458 5 25075 178262

Table 7. Interactions_validation Raw Data Description

Rows: 7,023
Columns: 6
No Null Values
Variable Name
Data type
Variable Category
Description
user_id string nominal User ID
recipe_id integer nominal Recipe ID
date string string Date of interaction
rating float nominal Rating given
u integer nominal User ID, mapped to contiguous integers from 0
i integer nominal Recipe ID, mapped to contiguous integers from 0
user_id recipe_id rating u i
count 7023 7023 7023 7023 7023
mean 1.94779e+07 206406 4.23281 10298 100122
std 1.90469e+08 135238 1.30291 6709.5 52051.1
min 1533 120 0 5 144
25% 159119 89851.5 4 4569.5 56227
50% 352834 192146 5 9248 104819
75% 737332 311632 5 15637.5 146690
max 2.00223e+09 536464 5 25055 178263

Final Data Description

Table 8. User Rating Data Description

Matrix Type: Sparse
Rows: 927 Users
Columns: 4557 Recipe Profiles
No Null Values

Table 9. Recipe Profile Data Description

Matrix Type: Sparse
Rows: 4557 Profiles
Columns: 50 Features
No Null Values

Loading Recipe Data

Out[4]:
name minutes contributor_id submitted tags nutrition n_steps steps description ingredients n_ingredients
id
137739 arriba baked winter squash mexican style 55 47892 2005-09-16 ['60-minutes-or-less', 'time-to-make', 'course... [51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0] 11 ['make a choice and proceed with recipe', 'dep... autumn is my favorite time of year to cook! th... ['winter squash', 'mexican seasoning', 'mixed ... 7
31490 a bit different breakfast pizza 30 26278 2002-06-17 ['30-minutes-or-less', 'time-to-make', 'course... [173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0] 9 ['preheat oven to 425 degrees f', 'press dough... this recipe calls for the crust to be prebaked... ['prepared pizza crust', 'sausage patty', 'egg... 6
112140 all in the kitchen chili 130 196586 2005-02-25 ['time-to-make', 'course', 'preparation', 'mai... [269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0] 6 ['brown ground beef in large pot', 'add choppe... this modified version of 'mom's' chili was a h... ['ground beef', 'yellow onions', 'diced tomato... 13
59389 alouette potatoes 45 68585 2003-04-14 ['60-minutes-or-less', 'time-to-make', 'course... [368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0] 11 ['place potatoes in a large pot of lightly sal... this is a super easy, great tasting, make ahea... ['spreadable cheese with garlic and herbs', 'n... 11
44061 amish tomato ketchup for canning 190 41706 2002-10-25 ['weeknight', 'time-to-make', 'course', 'main-... [352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0] 5 ['mix all ingredients& boil for 2 1 / 2 hours ... my dh's amish mother raised him on this recipe... ['tomato juice', 'apple cider vinegar', 'sugar... 8

Table 10. Unprocessed Recipe Information Dataset Sample

Loading User Rating Data

Out[5]:
recipe_id date rating review
user_id
38094 40893 2003-02-17 4 Great with a salad. Cooked on top of stove for...
1293707 40893 2011-12-21 5 So simple, so delicious! Great for chilly fall...
8937 44394 2002-12-01 4 This worked very well and is EASY. I used not...
126440 85009 2010-02-27 5 I made the Mexican topping and took it to bunk...
57222 85009 2011-10-01 5 Made the cheddar bacon topping, adding a sprin...

Table 11. Unprocessed User Rating Dataset Sample

Data Pre-Processing

Several pre-processing steps were undertaken to achieve the desired item profile matrix and utility matrix to be used in the recommender system:

  • The item profile matrix was scaled per feature.
  • The utility matrix was mean-centered per user. This is to factor out the differences in the users' perception of a 3-star rating or a 4-star rating.
  • Included data from 2008 onwards.
  • Included only quick desserts (< 90mins to make).
  • Included only users with more than 5 ratings.
Out[7]:
recipe_id 124 128 281 345 352 355 387 407 520 779 ... 515888 516300 517744 519541 521251 527687 528546 532655 532736 532740
user_id
1535 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4291 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4439 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4470 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4740 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1803786474 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2000431901 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2001102678 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2001362355 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2001453193 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

974 rows × 5708 columns

Table 12. Final User Profile Dataset

Out[8]:
minutes n_steps calories total_fat sugar_pvd sodium_pvd protein_pvd sat_fat_pvd carbs_pvd all-purpose flour ... shortening sour cream sugar sweetened condensed milk vanilla vanilla extract vegetable oil walnuts water white sugar
id
23933 15 4 232.7 21.0 77.0 4.0 6.0 38.0 8.0 0.000000 ... 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
67664 10 3 164.6 3.0 5.0 1.0 4.0 6.0 11.0 0.000000 ... 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
40237 40 7 2252.6 130.0 802.0 101.0 44.0 54.0 117.0 0.000000 ... 0.0 0.0 0.228796 0.0 0.280123 0.000000 0.499522 0.000000 0.362322 0.000000
124286 35 5 3987.2 245.0 1306.0 260.0 106.0 369.0 205.0 0.000000 ... 0.0 0.0 0.246249 0.0 0.301491 0.000000 0.000000 0.000000 0.389961 0.000000
118843 20 3 5286.9 427.0 1630.0 159.0 163.0 656.0 224.0 0.000000 ... 0.0 0.0 0.000000 0.0 0.241829 0.000000 0.000000 0.000000 0.000000 0.408645
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
26525 55 7 305.9 26.0 102.0 8.0 7.0 12.0 11.0 0.000000 ... 0.0 0.0 0.211417 0.0 0.258845 0.000000 0.461579 0.414826 0.000000 0.000000
243190 90 7 569.5 51.0 139.0 19.0 14.0 23.0 20.0 0.000000 ... 0.0 0.0 0.212637 0.0 0.260339 0.000000 0.000000 0.000000 0.000000 0.000000
18693 80 7 310.2 25.0 96.0 10.0 7.0 28.0 13.0 0.000000 ... 0.0 0.0 0.193680 0.0 0.237129 0.000000 0.422854 0.380023 0.000000 0.000000
253705 30 5 175.2 13.0 65.0 3.0 4.0 25.0 8.0 0.000000 ... 0.0 0.0 0.000000 0.0 0.000000 0.317115 0.000000 0.000000 0.000000 0.000000
386631 40 7 147.1 7.0 69.0 9.0 5.0 3.0 8.0 0.443651 ... 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

5708 rows × 50 columns

Table 13. Final Item Profile Dataset

Insights from the Data

The most common ingredients in desserts are Brown Sugar and Baking Soda.

Figure 3. Wordcloud of the Ingredients of the Desserts

The most common dessert names are those with Chocolate and Apple.
While Cake and Cookie are the most common dessert type.

Figure 4. Wordcloud of the Dessert Names

10,012 ratings are 5.0 which is 76% of the total rows.

Figure 5. Histogram of the Dessert Ratings

3,372 items have only one rating which is 59% of the total dataset.

Figure 6. Histogram of the Dessert Rating Count

Figure 7. Bar Plot of the Desserts with Highest Number of Ratings

Methodology

Data%20PRE-Processing.png

Figure 8. Overview of the Methodology

Methodology Pipeline

1. Raw Data Exploration

  • Examine data structure and characteristics
  • Check for null data
  • Check for unwanted outliers

2. Data Cleaning and Preprocessing

  • Perform null value imputation
  • Perform removal of duplicate data
  • Filter the user data to:
    • Users who rated from 2009 to 2018
    • Users who rated 5 or more items
  • Filter the recipe data to:
    • Recipes that have been rated by the selected users.

3. Data Vectorization

  • Vectorize recipe data using TF-IDF vectorizer to convert the recipes into features.

4. Dimensionality Reduction

  • Perform dimensionality reduction by Singular Vector Decomposition (SVD) on user data.
  • Perform dimensionality reduction by Singular Vector Decomposition (SVD) on recipe data.

5. Clustering

  • Perform various clustering techniques on user data.
    • Representative-based clustering
      • K-means
      • K-medoids
    • Agglomerative clustering
      • Single Linkage
      • Average Linkage
      • Complete Linkage
      • Ward's Linkage
    • Density-based clustering
      • DBSCAN
      • OPTICS
    • Probabilistic clustering
      • Gaussian Mixture
  • Perform various clustering techniques on recipe data.
    • Representative-based clustering
      • K-means
      • K-medoids
    • Agglomerative clustering
      • Single Linkage
      • Average Linkage
      • Complete Linkage
      • Ward's Linkage
    • Density-based clustering
      • DBSCAN
      • OPTICS
    • Probabilistic clustering
      • Gaussian Mixture

6. Recommender System

  • Create a recommender system using various collaborative filtering methods.
    • User-based Collaborative Filtering
    • Item-based Collaborative Filtering
    • Latent Factor-based Collaborative Filtering
  • Generating User-cluster-specific recommendations
    • For Cluster 0
    • For Cluster 1
  • Generate Item-cluster-specific recommendations
    • For cluster 0
    • For Cluster 1
  • Generate recommendations based on user preference
  • Comparison of Evaluation Metrics
    • Error Score (RMSE, MSE)
    • Coverage SCore
    • Novelty Score
    • Personalization Score
    • Intra-list Similarity Score
    • Metrics Radar Plot Summary
  • Recommendation for clustered users and Items

Main Body

Although it is possible to cluster the data using various clustering techniques at the onset, it is highly recommended to perform dimensionality reduction to reduce computational complexity. It is possible to get away with decent results without performing dimensionality reduction, however, this is not the case for data of high dimensionality. This is what we call the curse of dimensionality, a phenomenon that leads to poor clustering performance.

In this specific context, we perform Singular Vector Decomposition (SVD) to both the user data and the recipe data. The choice of dimensionality reduction technique will depend on the nature of the data, in this case, however, SVD is more fitting, especially when data is sparse.

Dimensionality Reduction

We then check the number of singular vectors to retain if we want to preserve at least 80% of the information. This can be measured by the cumulative variance explained. Refer to the plot below.

Users Ratings

Figure 9. Line Graph of the Explained Variance per Singular Vectors Retained for User Profile

By plotting the first two Singular Vectors, we can visualize what the data looks like in two-dimensional space. Normally, the first two singular vectors should contain most of the information, but in this case, the first two singular vectors only represent approximately 2% of the information.

Figure 10. Scatter Plot of the User Profile Projected to the First Two Singular Vectors

Next, we plot the most important features of the first five singular vectors. By doing so, we can visualize the contributions of each feature to the overall information contained in the dataset.

Figure 11. Bar Plot of the Top 20 Important User Features of the First Five Singular Vectors

Recipe

Again, we perform Singular Vector Decomposition, but this time, we apply it to the recipe dataset. We also decided to retain at least 80% of the information which is about 31 singular vectors.

Figure 12. Line Graph of the Explained Variance per Singular Vectors Retained for Item Profile

For purposes of visualization, we plot the first two singular vectors, this equates to about 17% of the information. Again, this plot does not capture the true nature of the data.

Figure 13. Scatter Plot of the Item Profile Projected to the First Two Singular Vectors

Figure 14. Bar Plot of the Top 20 Important Item Features of the First Five Singular Vectors

Quick checkpoint: Let us summarize why we needed to perform dimensionality reduction is necessary:

  1. Curse of Dimensionality
  2. Computational Complexity
  3. Redundant Features

Dimensionality reduction helps address these issues by reducing the number of dimensions in the data while preserving the most important information. In general, this improves the performance of clustering algorithms through the following:

  1. Reducing computational complexity
  2. Improving the quality of clustering
  3. Makes it easier to identify the most important features of the data.

Clustering the Users

Now that we have performed dimensionality reduction, we can now perform clustering knowing that we are going to get higher quality clusters. We must take note that in selecting the optimal number of clusters, we must rely on the internal validation scores instead of the plot visualization. The plot only represents a relatively small amount of information, hence judging the clusters visually can lead to poor conclusions.

For clustering, we have decided to perform the following techniques: Note: we only showed three clustering techniques in the main body. The other techniques have been moved over to the appendices.

1. Representative-based Clustering

  • K-means
  • K-medoids

2. Hierarchical Clustering

  • Single Linkage
  • Average Linkage
  • Complete Linkage
  • Ward's Linkage

3. Density-based Clustering

  • Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
  • Ordering points to identify the clustering structure (OPTICS)

4. Probabilistic Clustering

  • Gaussian Mixture

First, we perform various clustering algorithms on the user dataset. Please refer to exhibit 1 for other clustering techniques performed on the user dataset.

K-means Clustering

For our clustering technique, we ultimately decided to use K-means clustering as it provides the following advantages:

  1. Highly interpretable
  2. Efficient (Fast run time)
  3. Simple
  4. Scalable
  5. Robust

We performed hyperparameter tuning using a simple grid-search algorithm where we perform clustering using a range of values to search for the best number of clusters.

Here is a summary of the scores:

  1. SSE - We want to look for the elbow. Inconclusive, all scores are approximately the same.
  2. Calinski-Harabasz Index - We want the highest score. Inconclusive, all scores are approximately the same.
  3. Silhouette Coefficient - We want a score closest to 0.5. Inconclusive, all scores are approximately the same.
  4. Davies-Bouldin Index - We want a score that is close to zero, but not zero. k=2 has a DB index closest to zero.

Using the DB index, we identified that our optimal number of clusters is k=2.

Figure 15. User Profile Scatter Plot of the KMeans Clustering with the Plot of Internal Validation Metrics for Each Iteration

Kmedoids Clustering

Figure 16. User Profile Scatter Plot of the KMedoid Clustering with the Plot of Internal Validation Metrics for Each Iteration

Gaussian Mixture Clustering

Figure 17. User Profile Scatter Plot of the Gaussian Mixture Clustering for Each Iteration

Figure 18. User Profile Line Plot of the Silhouette Score for Each Iteration of Gaussian Mixture Clustering

Figure 19. User Profile Line Plot of the BIC Score for Each Iteration of Gaussian Mixture Clustering

Figure 20. User Profile Line Plot of the Gradient of BIC Score for Each Iteration of Gaussian Mixture Clustering

Analyzing the Optimal Cluster for User

As an extra step, we analyze the result of our clustering at k number of clusters k=2. Here are our observations:

Cluster 0

  • 884 samples
  • 4.54 mean rating
  • 0.93 standard deviation
  • 2.12 mean rating count

Cluster 1

  • 90 samples
  • 4.48 mean rating
  • 1.10 standard deviation
  • 0.17 mean rating count

For the other observations, refer to the information found in the tables and plots below.

Figure 21. User Profile Scatter Plot of Optimal Clustering with KMeans

Out[28]:
sample_count mean_rating_count mean_rating std_rating
0 884 2.123511 4.535502 0.927936
1 90 0.172915 4.481924 1.096093

Table 14. Comparison Between the Two Clusters for User Profile

Cluster 0

Out[29]:
item user_count mean_rating
0 vanilla buttercream frosting from sprinkles ... 52 4.653846
1 reeses squares 5 ingredients no bake reese s 48 4.812500
2 my no roll pie crust 47 4.574468
3 edna s apple crumble aka apple crisp 45 4.577778
4 deb s favorite way to eat fresh fruit 39 4.897436
5 kittencal s bakery buttercream frosting icing 36 4.694444
6 big grandma s best peanut butter cookies 34 4.705882
7 soft batch chocolate chip cookies 32 4.437500
8 most incredible no fail pie crust 31 4.677419
9 the best peanut butter oatmeal cookies 29 4.413793

Table 15. Overview for Cluster 0 of the User Profile

Cluster 1

Out[30]:
item user_count mean_rating
0 kittencal s chocolate frosting icing 90 4.911111
1 kittencal s bakery buttercream frosting icing 12 4.750000
2 vanilla buttercream frosting from sprinkles ... 12 4.833333
3 my no roll pie crust 12 3.666667
4 most incredible no fail pie crust 8 3.875000
5 quick yellow cake 8 4.000000
6 peanut butter chocolate chunk cookies 8 4.750000
7 big grandma s best peanut butter cookies 7 4.714286
8 edna s apple crumble aka apple crisp 6 4.000000
9 1 pan fudge cake 6 4.666667

Table 16. Overview for Cluster 1 of the User Profile

Looking at the plots, the labels can be as follows:

Cluster 0

  1. Relative high number of rating count
  2. High number of data points

Cluster 1

  1. Relative low number of rating count
  2. Low number of data points

Based on this information, we can then label our clusters.

  • Cluster 0 - Standard Users
  • Cluster 1 - New Users

Clustering the Items

We also perform clustering to the recipe data following the same clustering method we applied to the user data. It must also be noted that reliance on visualizations is a big red flag. Instead, we must choose our optimal number of clusters using the internal validation criteria.

Please refer to exhibit 2 the other clustering techniques performed on the recipe dataset.

K-means Clustering

We also performed hyperparameter tuning using a simple grid-search algorithm where we perform clustering using a range of values to search for the best number of clusters.

Here is a summary of the scores:

  1. SSE - We want to look for the elbow. Inconclusive, all scores are approximately the same.
  2. Calinski-Harabasz Index - We want the highest score. Inconclusive, all scores are approximately the same.
  3. Silhouette Coefficient - We want a score closest to 0.5. k=2 has a score closest to 0.5
  4. Davies-Bouldin Index - We want a score that is close to zero, but not zero. k=2 has a DB index closest to zero.

Using the DB index and Silhouette coefficient, we identified that our optimal number of clusters is k=2.

Figure 22. Item Profile Scatter Plot of the KMeans Clustering with the Plot of Internal Validation Metrics for Each Iteration

K-medoids Clustering

Figure 23. Item Profile Scatter Plot of the KMedoid Clustering with the Plot of Internal Validation Metrics for Each Iteration

Gaussian Mixture Clustering

Figure 24. Item Profile Scatter Plot for Each Iteration of Gaussian Mixture Clustering

Figure 25. Item Profile Line Plot of the Silhouette Score for Each Iteration of Gaussian Mixture Clustering

Figure 26. Item Profile Line Plot of the BIC Score for Each Iteration of Gaussian Mixture Clustering

Figure 27. Item Profile Line Plot of the Gradient of BIC Score for Each Iteration of Gaussian Mixture Clustering

Analyzing the Optimal Cluster for Items

Now that we have identified the optimal number of clusters at k=2, let us further explore the characteristics of our clusters by performing a simple analysis. For each cluster, let us take the mean of each feature. This will give us a grasp of the characteristics of the data points contained within each cluster. Here are our observations:

Cluster 0

  1. Average calories is approximately 332 (Low-calorie dessert)
  2. Average sugar content is approximately 126 (Low sugar dessert)
  3. Average carbohydrate content is approximately 15 (Low carb dessert)

Cluster 1

  1. Average calories is approximately 3,551 (High-calorie dessert)
  2. Average sugar content is approximately 1,333 (High sugar dessert)
  3. Average carbohydrate content is approximately 157 (High carb dessert)

From this, we can infer that cluster 1 is comprised of high-nutrient desserts whereas cluster 0 is comprised of low-fat desserts. Of course, this method is not entirely reliable, as we are only looking at the raw values taken from the mean of each cluster. Hence in the next step, we will perform hypothesis testing to identify the significant features of each cluster.

Figure 28. Item Profile Scatter Plot of Optimal Clustering with KMeans

Out[38]:
minutes n_steps calories total_fat sugar_pvd sodium_pvd protein_pvd sat_fat_pvd carbs_pvd all-purpose flour ... shortening sour cream sugar sweetened condensed milk vanilla vanilla extract vegetable oil walnuts water white sugar
cluster
0 27.941318 5.254844 331.798819 22.916221 126.231039 7.581473 8.950729 34.458203 15.018638 0.025838 ... 0.013848 0.019654 0.128239 0.020571 0.078402 0.053085 0.017715 0.025228 0.054689 0.018798
1 37.702422 5.307958 3551.284429 271.453287 1333.211073 104.422145 82.352941 356.038062 156.837370 0.047068 ... 0.044164 0.026814 0.131160 0.029583 0.124157 0.052186 0.027115 0.025177 0.039620 0.034243

2 rows × 50 columns

Table 17. Comparison Between the Two Clusters for Item Profile

Cluster Explainability

If you've taken an advanced statistics class, chances are, your professor may have mentioned this along the lines of his/her discussion:

"In statistics, we do NOT eyeball"

Hypothesis testing is an essential statistical tool used to make decisions about a population based on sample data. It allows us to test whether an observed effect is statistically significant or whether it could have occurred by chance.

Our chosen statistical test is the SFIT method. The Single Feature Introduction Test (SFIT) is a simple and computationally efficient significance test for the features of a machine learning model. It identifies the statistically significant features as well as feature interactions of any order in a hierarchical manner. [14][15]

Training a 3-Hidden-Layer Neural Network and Performing the Single Feature Introduction Test (SFIT) for Explainable Clustering

Before we can perform classification, we first labeled each company using its cluster assignment from our chosen clustering algorithm (K-means, at k=2).

To perform classification on the clusters, we trained a 3-hidden-layer neural network that has ReLU activation functions, a first hidden size of 100, a second hidden size of 50, and a third hidden size of 25. The network is trained for at most 50 epochs using the Adam optimizer which has pruning capabilities.

After running the classification, we perform the Single Feature Introduction Test (SFIT) on the trained network by only using the data for a specific cluster, and returning a cluster's most important features.

Total number of observations: 5708
Number of features: 50
Cluster data:
Cluster 1 has 5418 samples
Cluster 0 has 290 samples
Prepare data for classification:
Number of observations in train set: 4051
Number of observations in validation set: 1129
Number of observations in test set: 528
Total number of observations: 5708
Fit neural network
36/36 [==============================] - 0s 1ms/step
Neural network accuracy on val set: 0.99 

Neural network bal acc on val set: 0.96 

17/17 [==============================] - 0s 1ms/step
Neural network acc on test set: 0.99 

Neural network bal acc on test set: 0.95 

Overall SFIT analysis:
                      variable    median  CI_lower_bound  CI_upper_bound
22                        eggs  0.022137        0.022137        0.022137
40                     vanilla  0.019781        0.019781        0.019781
2                     calories  0.019462        0.018577        0.021031
0                      minutes  0.017803        0.015291        0.018623
4                    sugar_pvd  0.017274        0.015810        0.018535
8                    carbs_pvd  0.013860        0.013860        0.015011
10               baking powder  0.013095        0.013095        0.013095
3                    total_fat  0.012675        0.011419        0.013892
1                      n_steps  0.011500        0.011500        0.011500
38                       sugar  0.010403        0.010403        0.010403
12                     bananas  0.010029        0.010029        0.010029
5                   sodium_pvd  0.009853        0.008771        0.010999
21                         egg  0.009477        0.009477        0.009477
6                  protein_pvd  0.008926        0.008926        0.010319
27                 lemon juice  0.008429        0.008429        0.008429
30                        nuts  0.008264        0.008264        0.008264
44                 white sugar  0.008068        0.008068        0.008068
37                  shortening  0.008010        0.008010        0.008010
9            all-purpose flour  0.007555        0.007555        0.007555
39    sweetened condensed milk  0.007148        0.007148        0.007148
14             chocolate chips  0.006328        0.006328        0.006328
18        confectioners' sugar  0.006210        0.006210        0.006210
43                     walnuts  0.006059        0.006059        0.006059
29                      nutmeg  0.005826        0.005826        0.005826
15                    cinnamon  0.005632        0.005632        0.005632
17                     coconut  0.005387        0.005387        0.005387
36  semi-sweet chocolate chips  0.005385        0.005385        0.005385
16                       cocoa  0.004856        0.004856        0.004856
28                        milk  0.004668        0.004668        0.004668
23                       flour  0.004205        0.004205        0.004205
26                       honey  0.004177        0.004177        0.004177
7                  sat_fat_pvd  0.004089        0.003498        0.004792
24            granulated sugar  0.004049        0.004049        0.004049
42               vegetable oil  0.004008        0.004008        0.004008
11                 baking soda  0.003861        0.003861        0.003861
31                         oil  0.003582        0.003582        0.003582
35                        salt  0.003333        0.003333        0.003333
41             vanilla extract  0.002894        0.002894        0.002894
19                   cool whip  0.002416        0.002416        0.002416
33              powdered sugar  0.002181        0.002181        0.002181
34                     raisins  0.002085        0.002085        0.002085
32                      pecans  0.001778        0.001778        0.001778
25             ground cinnamon  0.001473        0.001473        0.001473
20                cream cheese  0.000710        0.000710        0.000710
13                 brown sugar  0.000460        0.000460        0.000460

SFIT analysis per cluster:

Cluster 1
                      variable    median  CI_lower_bound  CI_upper_bound
22                        eggs  0.022137        0.022137        0.022137
40                     vanilla  0.019781        0.019781        0.019781
2                     calories  0.019445        0.019057        0.019822
0                      minutes  0.018623        0.018623        0.018623
4                    sugar_pvd  0.016848        0.016331        0.017359
8                    carbs_pvd  0.015011        0.013860        0.015011
3                    total_fat  0.013287        0.012675        0.013287
10               baking powder  0.013095        0.013095        0.013095
1                      n_steps  0.011500        0.011500        0.011500
38                       sugar  0.010403        0.010403        0.010403
12                     bananas  0.010029        0.010029        0.010029
5                   sodium_pvd  0.009853        0.009853        0.009853
21                         egg  0.009477        0.009477        0.009477
6                  protein_pvd  0.008926        0.008926        0.008926
27                 lemon juice  0.008429        0.008429        0.008429
30                        nuts  0.008264        0.008264        0.008264
44                 white sugar  0.008068        0.008068        0.008068
37                  shortening  0.008010        0.008010        0.008010
9            all-purpose flour  0.007555        0.007555        0.007555
39    sweetened condensed milk  0.007148        0.007148        0.007148
14             chocolate chips  0.006328        0.006328        0.006328
18        confectioners' sugar  0.006210        0.006210        0.006210
43                     walnuts  0.006059        0.006059        0.006059
29                      nutmeg  0.005826        0.005826        0.005826
15                    cinnamon  0.005632        0.005632        0.005632
17                     coconut  0.005387        0.005387        0.005387
36  semi-sweet chocolate chips  0.005385        0.005385        0.005385
16                       cocoa  0.004856        0.004856        0.004856
28                        milk  0.004668        0.004668        0.004668
7                  sat_fat_pvd  0.004556        0.004323        0.004558
23                       flour  0.004205        0.004205        0.004205
26                       honey  0.004177        0.004177        0.004177
24            granulated sugar  0.004049        0.004049        0.004049
42               vegetable oil  0.004008        0.004008        0.004008
11                 baking soda  0.003861        0.003861        0.003861
31                         oil  0.003582        0.003582        0.003582
35                        salt  0.003333        0.003333        0.003333
41             vanilla extract  0.002894        0.002894        0.002894
19                   cool whip  0.002416        0.002416        0.002416
33              powdered sugar  0.002181        0.002181        0.002181
34                     raisins  0.002085        0.002085        0.002085
32                      pecans  0.001778        0.001778        0.001778
25             ground cinnamon  0.001473        0.001473        0.001473
20                cream cheese  0.000710        0.000710        0.000710
13                 brown sugar  0.000460        0.000460        0.000460

Cluster 0
        variable    median  CI_lower_bound  CI_upper_bound
0       calories  1.840302        1.665860        1.989166
3     sodium_pvd  0.984258        0.875044        1.152583
5      carbs_pvd  0.823538        0.818391        0.826876
1      total_fat  0.769799        0.767124        0.777296
2      sugar_pvd  0.388514        0.377369        0.418586
4    protein_pvd  0.278332        0.275677        0.282281
7         butter  0.035504        0.035504        0.035504
9          water  0.033026        0.033026        0.033026
8  peanut butter  0.002266        0.002266        0.002266
6    baking soda  0.001014        0.001014        0.001014

Here are the five most important features of each cluster:

Cluster 0

  • calories
  • sodium_pvd
  • sat_fat_pvd
  • carbs_pvd
  • protein_pvd

Cluster 1

  • eggs
  • vanilla
  • calories
  • minutes
  • sugar_pvd

These features are what separate one cluster from the other. However, the results do not tell us exactly why these features are considered the most important. It does not tell us the feature's abundance or lack thereof.

Since we are comparing only two clusters, it is likely to get common significant features for both clusters. In our case, we got calories as a common significant feature. This could mean that for the calorie feature, both clusters are likely to demonstrate a significant difference in their mean or median.

To augment our results from the SFIT method, we can check the values for each of these features for both clusters. If you remember, earlier, we took a look at the mean/median of calories and saturated fat for both features, and truly, these clusters are significantly different in terms of calories and saturated fat.

Recap:

Cluster 0

  1. Average calories is approximately 332 (Low-calorie dessert)
  2. Average sugar content is approximately 126 (Low sugar dessert)
  3. Average carbohydrate content is approximately 15 (Low carb dessert)

Cluster 1

  1. Average calories is approximately 3,551 (High-calorie dessert)
  2. Average sugar content is approximately 1,333 (High sugar dessert)
  3. Average carbohydrate content is approximately 157 (High carb dessert)

Now that we have this information, we can then label our clusters.

  • Cluster 0 - Low-fat Dessert
  • Cluster 1 - High-fat Dessert

Building the Recommender System

This section is the most important part of this study. This section directly answers the problem statement (why this study exists). Let us recap:

What?

  • Create a recommender system for the dessert recipes in food.com to help users decide among the thousands of desserts available on the platform.

Why?

  • Food.com only has a fixed set of recommendations based on the recipe you are currently viewing. The goal is to create a personalized recommendation system that suggests recipes based on user ratings.

Now before we proceed, let us first ask some questions to better understand what recommender systems are and what value they provide to a business.

What is a recommender system?

A recommendation system is a subclass of Information filtering Systems that seeks to predict the rating or the preference a user might give to an item. In simple words, it is an algorithm that suggests relevant items to users. Eg: In the case of Netflix which movie to watch, In the case of e-commerce which product to buy, or In the case of kindle which book to read, etc. [16]

Why are recommender systems important?

Recommender systems are important because they help users discover and engage with content or products that are most relevant and interesting to them. In an era of information overload, recommender systems play a critical role in filtering and personalizing content for individual users.

Here are some specific reasons why recommender systems are important:

  1. Personalization
  2. Increased Sales and Revenue
  3. Efficient Use of Resources
  4. Enhanced User Experience

To summarize, recommender systems are powerful business tools that provide personalized recommendations to users. It helps users discover new content and products that they may not have found otherwise, while also driving sales and revenue for businesses.

image.png

Figure 29. A simple representation of a recommender system
Source: https://nafeea3000.medium.com/recommender-systems-c8db209dd0d3

There are many types of recommender systems and we can go all day talking about them. However, for this study, we need to help Chef Almond choose the best type of recommender system out of the following:

  1. User-based Collaborative Filtering
  2. Item-based Collaborative Filtering
  3. Latent Factor-based Collaborative Filtering

For this study, we will be using the Scikit-Surprise Library. The scikit-surprise library is a popular Python library for building recommender systems using collaborative filtering techniques. We algorithms we decided to use user-based kNN, item-based kNN, and SVD. Here's a comparison of the user-based k-NN, item-based k-NN, and SVD algorithms in the scikit-surprise library:

  1. User-based kNN
  2. Item-based kNN
  3. SVD (Latent Factor)

After implementing the three aforementioned algorithms, we will then evaluate each model and identify which of the three performs the best.

User-based Collaborative Filtering

This algorithm is an implementation of user-based collaborative filtering using k-NN. It computes similarities between users and uses the k-nearest neighbors to make recommendations. User-based k-NN can be fast and effective for small to medium-sized datasets but can suffer from scalability and sparsity issues for larger datasets.

Recommendation Results

Computing the pearson similarity matrix...
Done computing similarity matrix.
1535 4291 4439 4470 4740
Likes Recommends Likes Recommends Likes Recommends Likes Recommends Likes Recommends
Top1 soft batch chocolate chip cookies catherine s excellent yorkshire pudding almond joy fudge brownies catherine s excellent yorkshire pudding simple apple crisp catherine s excellent yorkshire pudding apple raisin sauce for baked ham catherine s excellent yorkshire pudding 5 minute toasted almond cheesecake pie catherine s excellent yorkshire pudding
Top2 easy blueberry lemon parfait caramel apple milkshakes soft and chewy m m cookies caramel apple milkshakes rice krispies bundt cake caramel apple milkshakes papaya yogurt boats caramel apple milkshakes real honest to goodness indoor s mores really caramel apple milkshakes
Top3 frog cupcakes french pecan pie doctored up vanilla pudding casserole french pecan pie toffee almond blondies french pecan pie peppered strawberries french pecan pie sopaipilla cheesecake french pecan pie
Top4 creamsicle float mashed sweet potato pie kittencal s bakery buttercream frosting icing mashed sweet potato pie ms lamorte s apple crisp mashed sweet potato pie fruit and fiber parfait ww friendly 1 point mashed sweet potato pie burger king s hershey s sundae pie mashed sweet potato pie
Top5 perfect sugar cookie glaze banana orange ice cream grandma may s molasses cookies banana orange ice cream fruitcake cookies made easy banana orange ice cream non alcoholic hard sauce banana orange ice cream unknownchef86 s instant pudding ice cream banana orange ice cream

Table 18. Overview of Likes and Recommendations for User-Based Collaborative Filtering

Evaluating the model with MSE and RMSE

MSE:  0.29790148804741673
RMSE:  0.5458035251328235

Evaluating the Model Prediction Coverage

Coverage:  5.38

Item-based Collaborative Filtering

This algorithm is an implementation of item-based collaborative filtering using k-NN. It computes similarities between items and uses the k-nearest neighbors to make recommendations. Item-based k-NN can be faster and more scalable than user-based k-NN and can handle sparsity better. However, it may not work as well for datasets with a large number of items.

Computing the pearson similarity matrix...
Done computing similarity matrix.
1535 4291 4439 4470 4740
Likes Recommends Likes Recommends Likes Recommends Likes Recommends Likes Recommends
Top1 soft batch chocolate chip cookies catherine s excellent yorkshire pudding almond joy fudge brownies catherine s excellent yorkshire pudding simple apple crisp catherine s excellent yorkshire pudding apple raisin sauce for baked ham catherine s excellent yorkshire pudding 5 minute toasted almond cheesecake pie catherine s excellent yorkshire pudding
Top2 easy blueberry lemon parfait caramel apple milkshakes soft and chewy m m cookies caramel apple milkshakes rice krispies bundt cake caramel apple milkshakes papaya yogurt boats caramel apple milkshakes real honest to goodness indoor s mores really caramel apple milkshakes
Top3 frog cupcakes french pecan pie doctored up vanilla pudding casserole french pecan pie toffee almond blondies french pecan pie peppered strawberries french pecan pie sopaipilla cheesecake french pecan pie
Top4 creamsicle float mashed sweet potato pie kittencal s bakery buttercream frosting icing mashed sweet potato pie ms lamorte s apple crisp mashed sweet potato pie fruit and fiber parfait ww friendly 1 point mashed sweet potato pie burger king s hershey s sundae pie mashed sweet potato pie
Top5 perfect sugar cookie glaze banana orange ice cream grandma may s molasses cookies banana orange ice cream fruitcake cookies made easy banana orange ice cream non alcoholic hard sauce banana orange ice cream unknownchef86 s instant pudding ice cream banana orange ice cream

Table 19. Overview of Likes and Recommendation for Item-Based Collaborative Filtering

Evaluating the model with MSE and RMSE

MSE:  0.8885198523637632
RMSE:  0.9426133100926186

Evaluating the Model Prediction Coverage

Coverage:  1.45

Latent Factor-based Collaborative Filtering

The SVD algorithm is a latent factor-based collaborative filtering algorithm that uses matrix factorization to learn latent factors that capture user preferences and item characteristics. It can handle large datasets and sparse data well and can be more accurate than k-NN methods in many cases. However, it can be slower to train and can be more difficult to interpret.

1535 4291 4439 4470 4740
Likes Recommends Likes Recommends Likes Recommends Likes Recommends Likes Recommends
Top1 soft batch chocolate chip cookies kittencal s easy creamy white glaze almond joy fudge brownies berry sherbet surprise simple apple crisp oatmeal chocolate chip cookies ii apple raisin sauce for baked ham apple filling for pies 5 minute toasted almond cheesecake pie pumpkin bundt cake ii
Top2 easy blueberry lemon parfait strawberries stuffed with cranberry mascarpone soft and chewy m m cookies apple or pear crisp for one rice krispies bundt cake easy cake mix cookies papaya yogurt boats apple or pear crisp for one real honest to goodness indoor s mores really christmas rum balls or bourbon balls
Top3 frog cupcakes syrniki with potato doctored up vanilla pudding casserole julie s chocolate chip cookies toffee almond blondies coconut creme fudge peppered strawberries christmas rum balls or bourbon balls sopaipilla cheesecake reeses squares 5 ingredients no bake reese s
Top4 creamsicle float hershey s chocolate crumb crust kittencal s bakery buttercream frosting icing chocolate orange bundt cake ms lamorte s apple crisp individual mini cherry cheesecakes fruit and fiber parfait ww friendly 1 point blueberry filling for pies burger king s hershey s sundae pie brown sugar brownies blondies
Top5 perfect sugar cookie glaze avocado and berry pudding raw food grandma may s molasses cookies blueberry cake fruitcake cookies made easy amish country strawberry pie non alcoholic hard sauce french pear pudding unknownchef86 s instant pudding ice cream get up go bars

Table 20. Overview of Likes and Recommendation for Latent Factor-Based Collaborative Filtering

Evaluating the model with MSE and RMSE

MSE:  0.08926985933597519
RMSE:  0.29878062075036793

Evaluating the Model Prediction Coverage

Coverage:  41.42

Comparison of Evaluation Metrics

Now that we have finished building the three recommender systems, how do we compare them to each other?

There are debates about how to evaluate a recommender system or what KPIs should be employed to make a good comparison. Recommender systems can be evaluated in many ways using several metrics where each metric group has its purpose. However, for this study, we will be comparing our recommender systems using the following metrics:

  1. Errors (MSE and RMSE)
  2. Coverage
  3. Novelty
  4. Personalization
  5. Intra-list Similarity

Error Scores

Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) are used to evaluate the accuracy of predicted values that such as ratings compared to the true value, y. These can also be used to evaluate the reconstruction of a rating matrix. [13]

MSE=i=1N(yiy^i)2N
RMSE=i=1N(yiy^i)2N

Figure 30. Bar Plot Comparing Error Scores between Different Methods of Recommender System

Coverage Scores

Coverage is the percent of items in the training data the model can recommend on a test set.[9][11]

coverage=IN×100

Where 'I' is the number of unique items the model recommends in the test data, and 'N' is the total number of unique items in the training data. The catalog coverage is the rate of distinct items recommended over a period of time to the user. For this purpose, the catalog coverage function take also as parameter 'k' the number of observed recommendation lists. In essence, both metrics quantify the proportion of items that the system can work with. [13]

To better understand this, say we have 100 items in our training data, and we set the number of recommendations=10 where the recommendations are always the same set of items for all users. Then this would mean that the coverage is only 10% since it only ever recommends the same 10 items from the original data. In math terms, it is the union of all user recommendations over the total number of items.

Here, the user-based collaborative filtering has an 18.97% coverage, while the item-based collaborative filtering is only able to recommend 1.58% of the items it was trained on. The recommender system with the highest coverage is the latent factor-based collaborative filtering with 41.42% coverage.

Figure 31. Bar Plot Comparing Coverage Scores between Different Methods of Recommender System

Novelty Scores

Novelty measures the capacity of a recommender system to propose novel and unexpected items which a user is unlikely to know about already. It uses the self-information of the recommended item and it calculates the mean self-information per top-N recommended list and averages them over all users.[13]

novelty=1|U|uUitopNlog2(count(i)|U|)|N|

Where the absolute U is the number of users, count(i) is the number of users who consumed the specific item and N is the length of the recommended list.[13]

The novelty metric in recommender systems measures the degree to which recommended items are new and unexpected to the user. It is a way of quantifying how much the recommendations can introduce the user to new items that they have not previously encountered or considered. [8][10]

The novelty metric is important in recommender systems because it can help to prevent the problem of "filter bubbles," where users are only recommended items that align with their existing preferences and interests. By introducing users to new and unexpected items, recommender systems can help to broaden their horizons and expose them to new ideas and experiences. [8][10]

Figure 32. Bar Plot Comparing Novelty Scores between Different Methods of Recommender System

Personalization Scores

Personalization is a great way to assess if a model recommends many of the same items to different users. It is the dissimilarity (1- cosine similarity) between a user’s lists of recommendations. An example will best illustrate how personalization is calculated. [9]

image.png

A high personalization score indicates a user’s recommendations are different, meaning the model is offering a personalized experience to each user. [9]

Summary:

  • A high score indicates good personalization (a user's lists of recommendations are different).
  • A low score indicates poor personalization (a user's lists of recommendations are very similar).
  • A model is "personalizing" well if the set of recommendations for each user is different.[12]

Figure 33. Bar Plot Comparing Personalization Scores between Different Methods of Recommender System

Intra-list Similarity Scores

The intra-list similarity is the average cosine similarity of all items in a list of recommendations. This calculation uses features of the recommended items (such as movie genre) to calculate the similarity. This calculation is also best illustrated with an example. [9]

image.png

If a recommender system is recommending lists of very similar items to single users (for example, a user receives only recommendations of romance movies), then the intra-list similarity will be high. [9]

Figure 34. Bar Plot Comparing Intra-list Similarity Scores between Different Methods of Recommender System

Metrics Radar Plot Summary

Now that we have all the scores for each recommender system, let us compare them using a radar plot. Here, we compare the personalization, coverage, and intra-list similarity. The reason why novelty and error scores are not part of this radar plot is that they operate under a different scale and must be visualized differently. Regardless, let us proceed with the comparison using the radar plot.

How do we know that one model is outperforming the others? In a radar plot, the model that has the largest area is considered to be the best performing.

In this case, the latent-factor-based model outperforms the other two by quite a huge margin, especially in personalization and coverage.

radar1.png

Figure 35. Radar Plot Comparing Scores between Different Methods of Recommender System

Metrics Table Summary

Let us put all the values in a single table so we can compare the numbers. Let us summarize these metrics and put them into context:

  1. MSE and RMSE - Lower is better. This measures the accuracy of the recommendations made by the system.
  2. Coverage - Higher is better. This measures the proportion of items in the catalog that are recommended by the system. A high coverage indicates that the system can recommend a large number of items.
  3. Novelty - Contextual. This measures the novelty of the recommendations made by the system. A high novelty indicates that the system recommends items that the user has not seen before. Users however may prefer a more balanced novelty.
  4. Personalization - Higher is better. This measures how personalized the recommendations are for each user. A high personalization metric indicates that the recommendations are tailored to the individual user's preferences and needs, rather than being generic or popular recommendations.
  5. Intra-list-Similarity - Higher is better. This measures the similarity between the items in a list of recommendations. A low intra-list similarity metric indicates that the recommendations are diverse and cover a wide range of items, while a high intra-list similarity metric indicates that the recommendations are similar to each other and may not offer much variety (this is desirable).
User-Based Item-Based Latent Factor-Based
MSE 0.297901 0.888520 0.089270
RMSE 0.545804 0.942613 0.298781
Coverage 5.380000 1.450000 41.420000
Novelty 9.135441 9.343042 7.735434
Personalization 0.312501 0.218772 0.972720
Intra-list Similarity 0.921288 0.938223 0.940725

Table 21. Sumamry of Metrics between the Different Methods of Recommender System

From these values, it is apparent that the latent-factor-based model heavily outperforms the other models. Moving forward, we will be using the latent-factor-based model (SVD) to conduct cluster-specific recommendations.

Recommender System per Cluster of Users

Now we do the same thing in the next sections. This time, however, we incorporate clustering with recommender systems. First, we will tackle user-cluster-specific recommendations. What does this mean?

Users are first grouped into clusters based on their preferences or behaviors. Then, recommendations are generated for each user based on the preferences of other users in the same cluster. This approach can help overcome the cold-start problem where new users or items have limited data available for recommendation

User-cluster-specific recommendation for user_cluster 0

These are the recommendations for User Cluster 0.

804851 804550 723491 162826 294435
Likes Recommends Likes Recommends Likes Recommends Likes Recommends Likes Recommends
Top1 catherine s excellent yorkshire pudding easy cake mix cookies caramel apple milkshakes catherine s excellent yorkshire pudding french pecan pie lemon sour cream pie mashed sweet potato pie cake mix chocolate cookies banana orange ice cream layer cookies magic layer bars
Top2 hidden chocolate mint cookies grandma s soft sugar cookies warm cocoa coffee apple pie cake never fail chocolate cake easy chocolate dipping sauce those pretzel things pecan rollo bites quick n easy fruit dip sour cherry pie black coffee chocolate cake
Top3 champorado chocolate rice pudding christmas rum balls or bourbon balls chocolate panini low fat fudge bars rich creamy chocolate peanut butter milk shake black coffee chocolate cake chocolate mocha pudding low carb black coffee chocolate cake microwave peanut brittle christmas rum balls or bourbon balls
Top4 aunt gerry s peanut butter fudge creamsicle ice cream mocha delight healthy pie crust kittencal s bakery buttercream frosting icing easy frosting aunt tootsie s lemon cake chewy macaroons meringue cookies or cloud cookies orange chocolate ice cream sauce
Top5 linda s frozen fruit sicles individual mini cherry cheesecakes giggly cow creme heloise s cake mix cookies strawberry rhubarb streusel pie christmas rum balls or bourbon balls southern style lemon chess pie filling christmas rum balls or bourbon balls chilled fruit soup vanilla almond rice krispies treats

Table 22. Overview of Likes and Recommendation Specific to Cluster 0 of User Profile

MSE:  0.08857668355737312
RMSE:  0.29761835218509813
Coverage:  42.66

User-cluster-specific recommendation for user_cluster 1

These are the recommendations for User Cluster 1

324621 198154 255179 8629 136511
Likes Recommends Likes Recommends Likes Recommends Likes Recommends Likes Recommends
Top1 healthy pie crust pumpkin bundt cake ii peanut butter blossoms cappuccino rum shakes kittencal s chocolate frosting icing soft coconut macaroons ultimate chocolate chip bars ghost cookies fast and easy coconut custard pie sugar cookie icing
Top2 raspberry sherbet punch coconut condensed milk cake pumpkin cupcakes chocolate dipped krispies peanut butter balls amish brownies peanut butter chocolate chunk cookies easy cherry pie filling crisp easy chocolate shop microwave chocolate fudge deep dish apple pie with its own crust chocolate covered strawberries
Top3 paige s buttercream frosting cranberry bars diet coke cake chocolate icing frosting cake decorators frosting baileys irish cream truffles chocolate glaze or frosting german chocolate cake icing pink stuff cherry pie filling pineapple d... coconut condensed milk cake
Top4 caramel cake frosting icing no bake vanilla orange balls chocolate yogurt melts butterscotch treats chocolate frosting hg s quickie caramel custard ww points 2 butter cream icing buttercream frosting graham cracker crust chips ahoy dessert nif s peanut butter banana muffins
Top5 cake decorating icing zucchini chocolate cake 4 points diet soda cake ila s apple crisp heloise s cake mix cookies amish boiled cookies 2 ww points single crepe linda s apple brownies bisquick chocolate chip bars vegan chocolate pudding

Table 23. Overview of Likes and Recommendation Specific to Cluster 1 of User Profile

MSE:  0.09002211436803476
RMSE:  0.30003685501623756
Coverage:  52.97

Comparison of Evaluation Metrics

These are the evaluation results for each set of recommendations made, for user cluster 0 and user cluster 1.

Error Scores

Figure 36. Bar Plot Comparing Error Scores between Recommender System Specific to User Profile Clusters

Coverage Scores

Figure 37. Bar Plot Comparing Coverage Scores between Recommender System Specific to User Profile Clusters

Novelty Scores

Figure 38. Bar Plot Comparing Novelty Scores between Recommender System Specific to User Profile Clusters

Personalization Scores

Figure 39. Bar Plot Comparing Personalization Scores between Recommender System Specific to User Profile Clusters

Intra-list Similarity Scores

Figure 40. Bar Plot Comparing Intra-list Similarity Scores between Recommender System Specific to User Profile Clusters

Metrics Radar Plot Summary

radar_user.png

Figure 41. Radar Plot Comparing Scores between Recommender System Specific to User Profile Clusters

Recommender System per Cluster of Items

Earlier, we did user-cluster-specific recommendations. This time, we will do item-cluster-specific recommendations. What does this mean?

Clustering item profiles are used to group similar items together based on their characteristics and attributes. This helps to improve the accuracy and efficiency of the recommendation process by reducing the amount of computation needed to find relevant items for a user.

By grouping similar items, the system can make recommendations that are more tailored to a user's preferences, since items within a cluster are likely to have similar appeal. Moreover, clustering item profiles can help to address the problem of cold start.

Item-cluster-specific recommendation for item_cluster 0

These are the recommendations for item cluster 0.

126440 104295 176615 494084 56251
Likes Recommends Likes Recommends Likes Recommends Likes Recommends Likes Recommends
Top1 microwave scalloped apples banana orange ice cream no bake honey oat peanut butter bars mocha delight lemon sour cream pie lemon jello cake door county cherry dessert those pretzel things pecan rollo bites ultimate chocolate chip bars quick n easy fruit dip
Top2 christmas rum balls or bourbon balls chocolate coffee ice cream soda molasses oatmeal cookies frozen chocolate chip cookie dough balls almost apple pie chocolate lovers favorite cake healthy for them yogurt popsicles easy pecan bars lemon pineapple can can dessert white chocolate covered oreos
Top3 austrian strawberry torte coffee punch with ice cream floats banana tahini malted diabetic maple cheesecake tapioca pudding using minute tapioca easy cake mix cookies decorator buttercream icing best zucchini brownies ever sweet lite fluffy stuff nestle toll house walnut pie aka black cat pie
Top4 chocolate chip cannoli filling w kahlua cre... heloise s cake mix cookies chocolate madeleines sauteed strawberries with a twist kittencal s easy creamy white glaze fruit sorbet simple mint chocolate chip ice cream kittencal s easy creamy white glaze peanut butter no bake cookies creamy fruit salad
Top5 authentic no refrigeration bakery frosting icing pumpkin bundt cake ii weight watchers 1 point ice cream sandwich caramel filling the best oatmeal chocolate chip cookies special k bars graham cracker crust creamy brownie frosting pink stuff cherry pie filling pineapple d... strawberries with sour cream and brown sugar

Table 24. Overview of Likes and Recommendation Specific to Cluster 0 of Item Profile

MSE:  0.08758545524004101
RMSE:  0.2959483996240578
Coverage:  43.22

Item-cluster-specific recommendation for item_cluster 1

These are the recommendations for item cluster 1.

68526 453856 814629 204024 452940
Likes Recommends Likes Recommends Likes Recommends Likes Recommends Likes Recommends
Top1 wacky cake peanut butter cookies johnny cash s mother s ... wacky cake mija greek candy wacky cake pumpkin pie simply the best sand castle brownie mix jeanie s cake frosting cocada coconut candy ina gartens chocolate buttercream frosting
Top2 grape nut pudding blender chocolate pie fall pumpkin cake out of this world chocolate cookies sweet potato pie my eye my favorite chocolate chip cake pumpkin pie simply the best hurry up pie frosty peach pie supreme julie s chocolate chip cookies
Top3 None sarah s m m cookies None no bake cookies with a measure of love 10 minute german chocolate pie eggless milkless butterless spice cake 5 min ice cream lacy oatmeal crisp cookies 250 00 chocolate chip cookies apple and date loaf
Top4 None nubbly apple cake None paula deen s apple butter pumpkin pie None pig lickin cake 100 chocolate cake my favorite chocolate chip cake None easy apple cake
Top5 None butternut squash cake None chocolate chip bundt cake with chocolate glaze None peanut butter cookies johnny cash s mother s ... None perfect plum pie None chocolate frosting

Table 25. Overview of Likes and Recommendation Specific to Cluster 1 of Item Profile

MSE:  0.03239958567086679
RMSE:  0.1799988490820616
Coverage:  82.01

Comparison of Evaluation Metrics

These are the evaluation results for each set of recommendations made, for item cluster 0 and item cluster 1.

Error Score

Figure 42. Bar Plot Comparing Error Scores between Recommender System Specific to Item Profile Clusters

Coverage Score

Figure 43. Bar Plot Comparing Coverage Scores between Recommender System Specific to Item Profile Clusters

Novelty Score

Figure 44. Bar Plot Comparing Novelty Scores between Recommender System Specific to Item Profile Clusters

Personalization Score

Figure 45. Bar Plot Comparing Personalization Scores between Recommender System Specific to Item Profile Clusters

Intra-list Similarity Score

Figure 46. Bar Plot Comparing Intra-list Similarity Scores between Recommender System Specific to Item Profile Clusters

Metrics Radar Plot Summary

radar_item.png

Figure 47. Radar Plot Comparing Scores between Recommender System Specific to Item Profile Clusters

Based on User Preference per Cluster

We also make recommendations based on user preference. This is a variation of a combined implementation of clustering and collaborative filtering. Why variation? Because there are many ways to do a combined implementation of clustering and collaborative filtering.

Why do we want this?

Combining clustering of both users and items with recommender systems can lead to better recommendations for users in several ways:

  • Improved accuracy: Clustering users and items can help identify patterns in user behavior and preferences, as well as in the characteristics of the items being recommended. By leveraging this information, recommender systems can make more accurate recommendations that are tailored to the specific needs and preferences of individual users.

  • Increased coverage: Clustering can help identify items that may be of interest to users who have not yet interacted with them. By grouping similar items, recommender systems can recommend items that users may not have otherwise discovered, expanding the overall coverage of the system.

  • Increased diversity: Recommender systems can suffer from the problem of over-specialization, where users are recommended the same types of items over and over again. By clustering items and users, recommender systems can identify diverse sets of recommendations that still meet users' needs and preferences.

  • Improved scalability: Clustering can help with the scalability of recommender systems by reducing the dimensionality of the data being processed. This can make it easier and faster to generate recommendations, especially when dealing with large datasets.

Recommendation for Users who Like Cluster 0 of Desserts

126440 104295 176615 494084 56251
Likes Recommends Likes Recommends Likes Recommends Likes Recommends Likes Recommends
Top1 microwave scalloped apples banana orange ice cream no bake honey oat peanut butter bars mocha mousse lemon sour cream pie sour cream raisin pie door county cherry dessert gluten free waffles ultimate chocolate chip bars quick n easy fruit dip
Top2 christmas rum balls or bourbon balls lemon jello cake molasses oatmeal cookies diabetic peanut butter balls almost apple pie quick n easy fruit dip healthy for them yogurt popsicles shortcake lemon pineapple can can dessert quick and easy no bake chocolate cookies
Top3 austrian strawberry torte raspberry sherbet punch banana tahini malted chocolate cream cheese frosting tapioca pudding using minute tapioca rhubarb pie ii decorator buttercream icing strawberry popsicle sweet lite fluffy stuff very rich hot buttered rum
Top4 chocolate chip cannoli filling w kahlua cre... dynamite oatmeal chocolate chip raisin cookies chocolate madeleines a jamaican goddess kittencal s easy creamy white glaze coconut condensed milk cake simple mint chocolate chip ice cream rolo prailine pretzel bites peanut butter no bake cookies microwave peanut brittle
Top5 authentic no refrigeration bakery frosting icing oatmeal chocolate chip cookies ii weight watchers 1 point ice cream sandwich sara s 5 cup salad the best oatmeal chocolate chip cookies apple and rhubarb crumble graham cracker crust girl scout chocolate mint cookies copycat pink stuff cherry pie filling pineapple d... peanut butter blossoms

Table 26. Overview of Likes and Recommendation of Users who Like Cluster 0 of Item Profile

MSE:  0.08293727504090397
RMSE:  0.28798832448712913
Coverage:  43.02

Recommendation for Users who Like Cluster 1 of Desserts

68526 453856 814629 204024 452940
Likes Recommends Likes Recommends Likes Recommends Likes Recommends Likes Recommends
Top1 wacky cake golden sugar cookies wacky cake black coffee chocolate cake wacky cake bourbon chocolate pecan pie sand castle brownie mix maraschino cherry loaf cocada coconut candy star anise ice cream
Top2 grape nut pudding quick and easy cranberry pie fall pumpkin cake poppy seed pound cake sweet potato pie my eye veronica s lemon buttercream frosting pumpkin pie simply the best lemon pudding filled angel food cake 3 ingred... frosty peach pie supreme mom s icing
Top3 None easy as microwave chocolate fudge None never fail fudge 10 minute german chocolate pie sand castle brownie mix 5 min ice cream butterscotch cake the best you ll ever make 250 00 chocolate chip cookies sarah s m m cookies
Top4 None chocolate peanut butter frosting 2 None sour cream pastry None pumpkin chocolate chip bundt cake 100 chocolate cake fudge icing None bisquick chocolate chip bars
Top5 None canadian brown sugar pie None perfect plum pie None quick and easy cranberry pie None berry berry cool pie None peppermint buttercream icing

Table 27. Overview of Likes and Recommendation of Users who Like Cluster 1 of Item Profile

MSE:  0.021317368238131826
RMSE:  0.14600468567183666
Coverage:  89.21

Comparison of Evaluation Metrics

Error Score

Figure 48. Bar Plot Comparing Error Scores between Recommender System Specific to Users who Like a Specific Item Cluster

Coverage Score

Figure 49. Bar Plot Comparing Coverage Scores between Recommender System Specific to Users who Like a Specific Item Cluster

Novelty Score

Figure 50. Bar Plot Comparing Novelty Scores between Recommender System Specific to Users who Like a Specific Item Cluster

Perosonalization Score

Figure 51. Bar Plot Comparing Personalization Scores between Recommender System Specific to Users who Like a Specific Item Cluster

Intra-list Similarity Score

Figure 52. Bar Plot Comparing Intra-list Similarity Scores between Recommender System Specific to Users who Like a Specific Item Cluster

Metrics Radar Plot Summary

radar_combo.png

Figure 53. Radar Plot Comparing Scores between Recommender System Specific to Users who Like a Specific Item Cluster

Recommendation for clustered user and items

Another variation of a combined implementation of clustering and collaborative filtering is by changing the utility matrix to have rows as cluster labels of users and columns as cluster labels of items. This can be beneficial in several ways:

  • Reduced dimensionality: By clustering users and items, the number of rows and columns in the utility matrix is reduced, which can help with the scalability of the recommender system. This can be particularly useful when dealing with large datasets where the full utility matrix may be too large to store or process efficiently.

  • Improved accuracy: Clustering users and items can help to identify patterns in user behavior and preferences, as well as in the characteristics of the items being recommended. By grouping similar users and items, the recommender system can make more accurate recommendations that are tailored to the specific needs and preferences of each user cluster and item cluster.

  • Increased diversity: Clustering users and items can also help to increase the diversity of the recommendations being made. By identifying similar users and items within clusters, the recommender system can recommend a more diverse set of items to each user cluster.

  • Better handling of sparsity: When dealing with sparse data, clustering can help to identify latent relationships between users and items that may not be immediately obvious in the raw data. This can lead to better recommendations, even when there are only a few interactions between users and items.

Out[92]:
cluster_item 0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 48 49
cluster_user
0 NaN NaN 4.000000 NaN NaN 5.000000 NaN 0.000000 NaN NaN ... NaN NaN NaN NaN 5.000000 NaN NaN 4.333333 NaN NaN
1 NaN 3.750000 4.000000 NaN 5.000000 5.000000 4.400000 NaN 5.000000 5.000000 ... 3.600000 NaN 4.666667 4.833333 NaN 4.000000 5.000000 5.000000 4.000000 5.000000
2 NaN NaN 4.125641 NaN NaN 5.000000 5.000000 5.000000 4.000000 4.500000 ... NaN 5.000000 NaN 4.800000 NaN 5.000000 5.000000 4.833333 4.000000 NaN
3 NaN 5.000000 4.928571 NaN NaN 5.000000 4.666667 4.750000 4.000000 5.000000 ... 4.600000 5.000000 NaN 4.857143 NaN 5.000000 5.000000 4.400000 4.208333 5.000000
4 NaN 5.000000 NaN NaN 4.500000 5.000000 3.666667 4.666667 4.666667 5.000000 ... NaN 5.000000 4.666667 5.000000 NaN NaN 5.000000 5.000000 4.400000 5.000000
5 NaN 5.000000 4.654762 5.0 4.750000 5.000000 4.500000 5.000000 4.500000 4.333333 ... 4.714286 5.000000 5.000000 4.125000 5.000000 4.500000 4.333333 2.500000 4.500000 5.000000
6 3.875000 4.428571 4.308761 NaN 4.110465 4.783333 4.472789 4.552083 4.380296 4.453125 ... 4.660377 4.686508 4.465116 4.421875 4.476190 4.526316 4.483871 4.575758 4.344697 4.534483
7 NaN 5.000000 4.000000 NaN 4.500000 5.000000 5.000000 5.000000 4.500000 4.000000 ... 5.000000 5.000000 NaN 5.000000 NaN NaN NaN 5.000000 NaN 5.000000
8 5.000000 4.666667 3.250000 3.0 5.000000 4.750000 4.666667 5.000000 NaN 3.750000 ... 4.333333 2.500000 3.333333 3.333333 5.000000 4.500000 4.800000 NaN 4.500000 4.800000
9 4.500000 4.972222 4.500000 5.0 4.500000 4.000000 4.750000 5.000000 4.500000 4.071429 ... 4.600000 5.000000 4.600000 4.571429 0.000000 5.000000 5.000000 4.750000 4.642857 NaN
10 NaN 4.500000 3.750000 NaN NaN 5.000000 5.000000 4.500000 3.000000 4.666667 ... 4.500000 NaN 4.750000 NaN NaN 4.666667 2.357143 2.500000 4.600000 NaN
11 4.166667 4.900000 4.500000 NaN 4.666667 4.571429 4.500000 2.800000 4.500000 4.555556 ... 4.666667 5.000000 4.250000 4.800000 5.000000 5.000000 4.666667 4.857143 4.636364 3.857143
12 5.000000 5.000000 4.666667 NaN 5.000000 NaN 5.000000 3.500000 5.000000 2.333333 ... NaN 5.000000 NaN 5.000000 5.000000 4.250000 5.000000 3.333333 4.600000 NaN
13 5.000000 4.666667 4.750000 NaN 5.000000 NaN 5.000000 5.000000 5.000000 3.600000 ... 5.000000 5.000000 3.500000 4.250000 5.000000 4.000000 4.750000 4.333333 4.250000 NaN
14 NaN 5.000000 4.375000 NaN 5.000000 5.000000 4.000000 5.000000 3.750000 5.000000 ... 4.373016 4.375000 4.500000 3.500000 5.000000 5.000000 4.777778 5.000000 3.000000 5.000000
15 NaN 4.666667 3.000000 NaN 4.000000 4.750000 4.785714 5.000000 5.000000 4.800000 ... 4.000000 5.000000 4.458333 4.375000 NaN 4.666667 5.000000 4.250000 3.500000 5.000000
16 3.500000 4.000000 4.861111 NaN 4.714286 5.000000 4.000000 4.666667 5.000000 3.750000 ... 5.000000 5.000000 5.000000 4.900000 5.000000 5.000000 5.000000 5.000000 4.366667 4.666667
17 NaN 4.666667 4.400000 NaN 3.666667 3.800000 4.300000 5.000000 3.500000 4.600000 ... 4.125000 4.200000 5.000000 4.500000 5.000000 3.666667 NaN 4.666667 4.000000 4.600000
18 NaN 5.000000 NaN NaN 5.000000 5.000000 4.800000 5.000000 5.000000 4.500000 ... 5.000000 5.000000 NaN NaN NaN 5.000000 NaN NaN 5.000000 5.000000
19 4.000000 4.620690 4.000000 NaN 4.375000 4.850000 4.734848 4.714286 4.777778 4.588235 ... 4.250000 4.187500 4.500000 4.333333 4.888889 4.571429 4.562500 4.333333 4.071429 3.625000
20 0.000000 3.800000 5.000000 5.0 3.400000 5.000000 1.000000 4.000000 5.000000 4.765625 ... 4.071429 4.750000 5.000000 4.142857 2.833333 4.166667 4.600000 5.000000 4.333333 5.000000
21 3.000000 4.571429 4.400000 NaN 4.750000 4.400000 4.000000 4.000000 5.000000 4.750000 ... 5.000000 4.333333 4.333333 5.000000 3.333333 3.800000 1.000000 5.000000 4.857143 5.000000
22 NaN NaN 5.000000 NaN 5.000000 NaN NaN 4.500000 NaN 5.000000 ... NaN NaN NaN 4.687500 NaN NaN NaN 5.000000 NaN 4.000000
23 NaN 3.500000 5.000000 5.0 4.333333 4.500000 5.000000 5.000000 5.000000 4.600000 ... 5.000000 5.000000 5.000000 4.500000 4.666667 5.000000 NaN 3.000000 4.666667 4.666667
24 NaN 5.000000 4.437500 NaN NaN 5.000000 4.000000 5.000000 3.000000 4.933333 ... 5.000000 5.000000 NaN 2.750000 NaN 5.000000 NaN 5.000000 5.000000 5.000000
25 NaN 4.833333 4.500000 5.0 4.000000 4.800000 5.000000 4.666667 3.000000 4.600000 ... 4.500000 5.000000 5.000000 4.800000 NaN 4.750000 4.500000 5.000000 4.333333 5.000000
26 NaN NaN 5.000000 NaN NaN 5.000000 NaN NaN NaN NaN ... 5.000000 NaN NaN NaN NaN NaN 4.000000 5.000000 4.666667 NaN
27 NaN 5.000000 5.000000 NaN 4.000000 5.000000 3.500000 5.000000 0.000000 4.500000 ... 4.000000 3.333333 4.750000 5.000000 5.000000 5.000000 4.000000 4.000000 5.000000 4.500000
28 NaN 5.000000 4.666667 3.0 5.000000 1.000000 4.500000 5.000000 NaN 4.200000 ... NaN 4.000000 NaN 3.750000 NaN 5.000000 5.000000 4.000000 5.000000 5.000000
29 5.000000 4.000000 3.783333 NaN 5.000000 NaN 4.000000 4.600000 2.000000 5.000000 ... 3.000000 4.666667 5.000000 4.166667 NaN 5.000000 5.000000 5.000000 3.666667 5.000000
30 NaN 5.000000 5.000000 5.0 NaN 5.000000 4.750000 4.250000 4.571429 NaN ... 5.000000 4.700000 5.000000 5.000000 4.333333 2.000000 5.000000 2.500000 5.000000 5.000000
31 NaN 5.000000 4.988636 5.0 4.500000 3.333333 4.500000 5.000000 5.000000 4.800000 ... 4.916667 5.000000 4.666667 4.250000 3.000000 5.000000 4.975000 4.166667 4.800000 4.000000
32 NaN 5.000000 5.000000 NaN 5.000000 NaN 5.000000 5.000000 NaN 4.500000 ... NaN 5.000000 NaN NaN 4.000000 NaN 5.000000 4.000000 NaN 5.000000
33 5.000000 5.000000 5.000000 NaN 5.000000 5.000000 5.000000 5.000000 4.666667 NaN ... 5.000000 5.000000 5.000000 4.666667 5.000000 5.000000 5.000000 NaN 5.000000 4.600000
34 5.000000 NaN 4.250000 NaN 5.000000 5.000000 5.000000 4.750000 5.000000 4.833333 ... 5.000000 5.000000 2.333333 5.000000 NaN 5.000000 5.000000 NaN 4.500000 2.666667
35 NaN 5.000000 4.200000 NaN 4.000000 2.600000 5.000000 3.333333 4.500000 4.000000 ... 5.000000 NaN 3.250000 5.000000 5.000000 4.900000 5.000000 4.611111 4.166667 3.333333
36 5.000000 3.833333 5.000000 NaN 2.500000 3.800000 4.333333 3.750000 4.833333 5.000000 ... 4.500000 5.000000 3.666667 4.600000 5.000000 2.666667 5.000000 5.000000 4.428571 4.750000
37 NaN 3.000000 NaN NaN 4.750000 3.800000 3.000000 4.333333 5.000000 4.666667 ... 5.000000 4.000000 5.000000 5.000000 4.500000 4.666667 4.500000 3.000000 4.333333 NaN
38 NaN 4.884615 4.250000 4.0 4.000000 4.812500 4.733333 4.750000 4.714286 4.333333 ... 5.000000 4.363636 4.833333 4.900000 5.000000 4.900000 4.500000 4.142857 4.150000 4.535714
39 NaN 4.833333 5.000000 NaN 5.000000 5.000000 4.500000 4.666667 NaN 4.428571 ... 5.000000 3.333333 5.000000 4.600000 4.333333 3.750000 4.250000 NaN 5.000000 4.000000
40 5.000000 4.666667 4.300000 NaN 3.875000 4.600000 5.000000 5.000000 4.400000 4.296296 ... 4.250000 5.000000 5.000000 5.000000 4.333333 4.800000 4.000000 3.444444 4.307692 4.500000
41 NaN 3.800000 4.500000 5.0 4.700000 4.454545 4.875000 4.750000 4.500000 4.750000 ... 5.000000 4.500000 4.000000 4.285714 3.000000 4.600000 4.000000 4.571429 4.727273 5.000000
42 4.000000 5.000000 4.250000 5.0 4.500000 4.750000 5.000000 5.000000 4.666667 5.000000 ... 5.000000 NaN 5.000000 4.500000 5.000000 4.750000 3.333333 5.000000 5.000000 4.937500
43 NaN 5.000000 4.962963 NaN 4.500000 5.000000 5.000000 3.500000 4.150000 4.750000 ... 5.000000 5.000000 4.750000 5.000000 5.000000 4.800000 3.333333 5.000000 4.750000 4.400000
44 5.000000 4.931818 4.660494 5.0 4.821429 4.809524 4.850000 4.909091 4.869822 4.926667 ... 4.666667 4.342857 4.923077 4.890000 4.909091 4.796000 4.928571 4.777778 4.596939 4.882353
45 NaN NaN 4.500000 NaN NaN NaN NaN NaN NaN 5.000000 ... 2.000000 5.000000 NaN 4.000000 NaN 5.000000 NaN NaN NaN NaN
46 NaN NaN NaN NaN 4.666667 5.000000 4.250000 5.000000 NaN 4.555556 ... 4.500000 5.000000 NaN 5.000000 4.000000 5.000000 NaN 5.000000 5.000000 NaN
47 5.000000 4.500000 5.000000 NaN 4.250000 4.916667 4.571429 4.500000 3.900000 4.625000 ... 4.833333 4.428571 4.500000 4.750000 5.000000 3.750000 5.000000 3.666667 3.833333 4.500000
48 NaN 5.000000 3.833333 NaN 4.000000 NaN 4.500000 5.000000 4.000000 4.714286 ... 4.750000 5.000000 5.000000 5.000000 2.500000 3.333333 4.875000 NaN 5.000000 5.000000
49 NaN 4.433333 4.750000 NaN NaN 5.000000 4.500000 4.000000 5.000000 3.500000 ... 3.500000 5.000000 4.000000 4.333333 NaN 5.000000 5.000000 4.000000 4.760000 5.000000

50 rows × 50 columns

Table 28. Overview of Utility Matrix Clustering Both User and Item Profiles

Out[93]:
0 1 2 3 4 ... 45 46 47 48 49
Likes Recommends Likes Recommends Likes Recommends Likes Recommends Likes Recommends ... Likes Recommends Likes Recommends Likes Recommends Likes Recommends Likes Recommends
Top1 27 17 48 17 20 17 5 27 5 30 ... 26 22 47 5 2.0 17.0 5 20 22 17
Top2 13 44 45 18 11 5 5 21 17 28 ... 39 6 22 17 1.0 18.0 15 1 19 44
Top3 27 4 25 38 43 32 45 39 46 13 ... 38 17 26 18 45.0 5.0 47 30 45 18
Top4 35 5 11 4 30 44 43 18 45 18 ... 35 45 30 45 42.0 NaN 40 47 43 30
Top5 42 28 28 6 28 29 38 19 40 27 ... 34 13 32 47 25.0 NaN 39 27 38 21

5 rows × 98 columns

Table 29. Overview of Likes and Recommendation for a Recommender System with Clustering of Both User and Item Profiles

Results and Discussions

Why is dimensionality reduction necessary?

Dimensionality reduction is necessary for several reasons:

  1. Computational Efficiency - By reducing the number of dimensions, the computational cost can be reduced significantly.
  2. Noise Reduction - High-dimensional data often contains noise points and redundant features that affect the quality of clustering.
  3. Curse of Dimensionality - As the number of dimensions in data increases, the more difficult it is to find meaningful patterns. Dimensionality reductions help in alleviating this.
  4. Memory Constraints - It requires a large amount of memory to store and process high-dimensional data. Reducing the number of dimensions makes it easier to work around the memory requirements.
  5. Data Visualization - It is difficult to visualize data in high dimensions.

Explain the rationale behind the choice of dimensionality reduction technique.

There are several dimensionality reduction techniques available which can be applied to data, below are some examples:

  1. Principal Component Analysis (PCA)
  2. Singular Vector Decomposition (SVD)

However, the choice of technique came down to using SVD. SVD is generally better for sparse data than PCA. It can handle matrices with missing or zero values and can also reduce the dimensionality in data without losing important information.

SVD is the more flexible and robust method, particularly for sparse data.

Since we are working with a utility matrix composed of food ratings, which is classified as sparse data, then SVD is the perfect technique to apply.

How many singular vectors were retained?

Below is the summary of the number of singular vectors retained for each dataset.

User Rating Matrix:

  • 539 Singular Vectors retained
  • 80% Cumulative Variance Explained

Item Profile Matrix:

  • 31 Singular Vectors retained
  • 80% Cumulative Variance Explained

What features were used for the item profile matrix?

For the item profile matrix, we used the following features:

  • minutes - Minutes to prepare the recipe
  • n_steps - Number of steps in the recipe
  • nutrition - Nutrition information
  • ingredients - List of ingredient names

The ingredients list was vectorized to represent the sequence of characters or words in a numerical format that the models can understand.

What clustering algorithms/techniques were used?

For clustering, we have decided to perform the following techniques: Note: we only showed three clustering techniques in the main body. The other techniques have been moved over to the appendices.

1. Representative-based Clustering

  • K-means
  • K-medoids

2. Hierarchical Clustering

  • Single Linkage
  • Average Linkage
  • Complete Linkage
  • Ward's Linkage

3. Density-based Clustering

  • Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
  • Ordering points to identify the clustering structure (OPTICS)

4. Probabilistic Clustering

  • Gaussian Mixture

Explain the rationale behind the choice of the clustering algorithm.

For our clustering technique, we ultimately decided to use K-means clustering as it provides the following advantages:

  1. Highly interpretable
  2. Efficiency (Fast run time)
  3. Simple
  4. Scalable
  5. Robust

How did we choose the optimate k number of clusters?

We performed hyperparameter tuning using a simple grid-search algorithm where we perform clustering using a range of values to search for the best number of clusters.

We must take note that in selecting the optimal number of clusters, we must rely on the internal validation scores instead of the plot visualization. The plot only represents a relatively small amount of information, hence judging the clusters visually can lead to poor conclusions.

Here is a summary of the scores:

  1. SSE - We want to look for the elbow. Inconclusive, all scores are approximately the same.
  2. Calinski-Harabasz Index - We want the highest score. Inconclusive, all scores are approximately the same.
  3. Silhouette Coefficient - We want a score closest to 0.5. Inconclusive, all scores are approximately the same.
  4. Davies-Bouldin Index - We want a score that is close to zero, but not zero. k=2 has a DB index closest to zero.

Using the DB index, we identified that our optimal number of clusters is k=2.

How did we label the clusters?

The clusters were labeled using a novel two-step method. A classifier is first trained to predict the cluster labels, then the Single Feature Introduction Test (SFIT) method is run on the model to identify the statistically significant features that characterize each cluster.

To perform classification on the clusters, we trained a 3-hidden-layer neural network that has ReLU activation functions, a first hidden size of 100, a second hidden size of 50, and a third hidden size of 25. The network is trained for at most 50 epochs using the Adam optimizer which has pruning capabilities.

After running the classification, we perform the Single Feature Introduction Test (SFIT) on the trained network by only using the data for a specific cluster, and returning a cluster's most important features.

Here are the five most important features of each cluster:

Cluster 0

  • calories
  • sodium_pvd
  • sat_fat_pvd
  • carbs_pvd
  • protein_pvd

Cluster 1

  • calories
  • flour
  • sat_fat_pvd
  • eggs
  • chocolate chips

These features are what separate one cluster from the other. However, the results do not tell us exactly why these features are considered the most important. It does not tell us the feature's abundance or lack thereof.

Since we are comparing only two clusters, it is likely to get some common significant features for both clusters. In our case, we got calories and saturated fat as commonly significant features. This could mean that for these features, both clusters are likely to demonstrate a significant difference in their mean or median.

To augment our results from the SFIT method, we can check the values for each of these features for both clusters. If you remember, earlier, we took a look at the mean/median of calories and saturated fat for both features, and truly, these clusters are significantly different in terms of calories and saturated fat.

Recap:

Cluster 0

  1. Average calories is approximately 332 (Low-calorie dessert)
  2. Average sugar content is approximately 126 (Low sugar dessert)
  3. Average carbohydrate content is approximately 15 (Low carb dessert)

Cluster 1

  1. Average calories is approximately 3,551 (High-calorie dessert)
  2. Average sugar content is approximately 1,333 (High sugar dessert)
  3. Average carbohydrate content is approximately 157 (High carb dessert)

Now that we have this information, we can then label our clusters.

  • Cluster 0 - Low-fat Dessert
  • Cluster 1 - High-fat Dessert

What is a recommender system?

A recommender system is an information filtering system that predicts and recommends items to users based on their preferences, interests, and past behavior. The goal of a recommender system is to provide personalized recommendations that are useful and relevant to the user.

Why is a recommender system important?

A recommender system is important for several reasons:

  1. Personalization - Recommendations are made to a user based on their preferences, interests, and past behaviors. This increases the likelihood that a user will continue using the platform or service.
  2. Discoverability - Recommender systems help users discover new items or content that they may not have found otherwise.
  3. Reduced Analysis Paralysis - Recommender systems help a user steer away from analysis paralysis which is the inability to decide due to overthinking a problem.

What type of recommender systems were tried out?

There are many types of recommender systems and we can go all day talking about them. However, for this study, we only chose to run and test three types namely:

  1. User-based Collaborative Filtering
  2. Item-based Collaborative Filtering
  3. Latent Factor-based Collaborative Filtering

Explain the rationale behind the choice of recommender system.

Each of the three collaborative filtering methods offers its own sets of advantages and disadvantages. It's hard to rank the three in terms of their qualitative aspects. However, by using evaluation metrics, we can easily choose the best recommender system that would perfectly fit the use case. The choice of recommender system was ultimately based on the evaluation metrics.

How did we evaluate the performance of the recommender system?

The performance of the recommender systems was evaluated based on the following metrics:

  1. MSE and RMSE - Lower is better. This measures the accuracy of the recommendations made by the system.
  2. Coverage - Higher is better. This measures the proportion of items in the catalog that are recommended by the system. A high coverage indicates that the system can recommend a large number of items.
  3. Novelty - Contextual. This measures the novelty of the recommendations made by the system. A high novelty indicates that the system recommends items that the user has not seen before. Users however may prefer a more balanced novelty.
  4. Personalization - Higher is better. This measures how personalized the recommendations are for each user. A high personalization metric indicates that the recommendations are tailored to the individual user's preferences and needs, rather than being generic or popular recommendations.
  5. Intra-list Similarity - Higher is better. This measures the similarity between the items in a list of recommendations. A low intra-list similarity metric indicates that the recommendations are diverse and cover a wide range of items, while a high intra-list similarity metric indicates that the recommendations are similar to each other and may not offer much variety (this is desirable).

What libraries were used to create the recommender systems?

For this study, we used the Scikit-Surprise Library. The scikit-surprise library is a popular Python library for building recommender systems using collaborative filtering techniques. We algorithms we decided to use user-based kNN, item-based kNN, and SVD.

  1. User-based kNN
  2. Item-based kNN
  3. SVD (Latent Factor)

In conjunction, we also used the recmetrics library, which is a python library of evaluation metrics and diagnostic tools for recommender systems. Recmetrics accepts the predictions from the scikit-surprise library and returns metrics such as:

  1. MSE and RMSE
  2. Coverage
  3. Novelty
  4. Personalization
  5. Intra-list Similarity

What problems did we encounter with dimensionality reduction?

The following were the problems encountered during the implementation of dimensionality reduction:

  • Since the reduced data still contained a high number of dimensions (539 and 31) we could not rely on the visualizations produced by projecting the first two singular vectors. Normally, the first two singular vectors should contain most of the information, in this case however, it was not enough for us to completely rely on the 2D plot.

  • The internal validation scores do not have a common majority top-ranked k number of clusters. Instead, you will have to use your judgment and use the information provided by the internal validation scores to decide the optimal number of clusters.

What problems did we encounter with clustering?

The following were the problems encountered during the implementation of clustering:

  • Some clustering methods took a long time to run. This is true, especially for density-based clustering.
  • Some metrics like BIC and gradient of BIC are hard to interpret.
  • On top of the long run times, the slow grid search algorithm also added fuel to the flames.

What problems did we encounter with the implementation of the recommender system?

  • User-based collaborative filtering can potentially funnel recommendations. This phenomenon causes the recommendations to be almost identical for all users. Although this can be remedied by setting minimum support for commonly rated items.
  • Learning the scikit-surprise library is not as easy as plug and play. It can take some time to understand the code syntax.
  • The scikit-surprise library only includes metrics such as RMSE, MSE, MAE, and FCP (Fraction of Concordant Pairs). If you want to use other metrics, you would have to do your implementation of the metric algorithm.

Conclusion

  1. Among the collaborative filtering techniques used, the best-performing one was latent-factor-based collaborative filtering. The algorithm was able to capture the complex relationships between users and items and it was able to make more accurate predictions than the user-based and item-based approaches.

  2. User-based collaborative filtering can potentially funnel recommendations. This phenomenon causes the recommendations to be almost identical for all users.

  3. The SFIT method does not work well with datasets having numerous features.

  4. The internal validation scores should hold more weight than visual appeal in two-dimensional space when deciding on the optimal number of clusters. This is especially true for data of high dimensionality.

  5. Data transformation should be performed user-wise, and not item-wise.

  6. The latent-based recommender system was able to address the inherent problem being complained of by Chef Almond:

    • A need for a more personalized recommendation - The LF-based recommender system had the highest personalization score.
    • Recipe recommendations that are not contained to certain predictions only - The LF-based recommender system had the highest coverage.
    • The ability to balance out expected and unexpected recommendations - The LF-based recommender system had a more balance novelty score.
    • Recommendations that are similar - The LF-based recommender system had a high intra-list similarity score.
  7. The choice of the clustering method depends on several factors, such as the nature of the data, the desired clustering outcome, and the resources available.

  8. Choosing the best number of clusters should not be based on the clustering that has the best visual appeal.

Recommendations

  1. This study can be further improved through the application of other clustering techniques and algorithms which could potentially reveal more meaningful insights from our data. See below for some examples of other clustering techniques and algorithms:

    • Model-based Clustering -This type of clustering assumes that data is a mixture of probabilistic models.
    • Subspace Clustering - This type of clustering finds clusters in subspaces of the feature space rather than the entire feature space.
    • Spectral Clustering - This type of clustering is a graph-based clustering that represents data as a graph and then applies graph theory to partition the graph into clusters.
    • Fuzzy Clustering - This type of clustering allows for a degree of uncertainty or "fuzziness" in assigning data points into clusters.
    • Constrained Clustering - This type of clustering allows users to specify constraints or conditions that must be satisfied by the clustering solution.
    • Deep Clustering - This type of clustering uses deep learning techniques, such as neural networks, to cluster data in an unsupervised manner.
    • Ensemble Clustering - This type of clustering uses multiple clustering algorithms or multiple runs of the same algorithm to produce a final clustering solution.
    • Self-Organizing Maps (SOM) - This type of clustering uses a neural network approach to create a low-dimensional representation of the data.
    • Affinity Propagation - This type of clustering algorithm is based on the concept of "message passing" between data points to find the cluster representatives.
    • Mean-Shift Clustering - This type of clustering algorithm is a non-parametric, density-based method that works by shifting "mean" points toward areas of high data density.
  2. Perform Content-based Collaborative Filtering to Capture the contributions of the ingredients as the features of each recipe profile.

  3. Try out other algorithms from the scikit-Surprise Library to explore possible algorithms that better fit the dataset such as:

    • random_pred.NormalPredictor - Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.

    • baseline_only.BaselineOnly - Algorithm predicting the baseline estimate for a given user and item.

    • knns.KNNBasic - A basic collaborative filtering algorithm.

    • knns.KNNWithMeans - A basic collaborative filtering algorithm, taking into account the mean ratings of each user.

    • knns.KNNWithZScore - A basic collaborative filtering algorithm, taking into account the z-score normalization of each user.

    • knns.KNNBaseline - A basic collaborative filtering algorithm taking into account a baseline rating.

    • matrix_factorization.SVD - The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.

    • matrix_factorization.SVDpp - The SVD++ algorithm, an extension of SVD taking into account implicit ratings.

    • matrix_factorization.NMF - A collaborative filtering algorithm based on Non-negative Matrix Factorization.

    • slope_one.SlopeOne - A simple yet accurate collaborative filtering algorithm.

    • co_clustering.CoClustering - A collaborative filtering algorithm based on co-clustering

  4. Explore the batch recommender system. This can be done by clustering both the users and items where their labels will serve as the rows and columns, respectively. The cluster mean will be used as the value for the utility matrix. This implementation will dramatically reduce the run time as the recommendations will be the same for each cluster.

  5. Combine the frequent itemset mining with recommender systems. By doing so, we can use the association rules to generate recommendations for the users. This can potentially provide better user recommendations.

  6. Explore other food categories like main dish, French, or 30-minute dish. This can also be applied to other types of food or desserts.

  7. Recommender systems are considered domain-agnostic and may be applied to other fields such as media, retail, clothing, and transportation.

  8. Another way to make the study more interesting is to reveal clusters within clusters. Although more complex at the onset, it can potentially lead to more meaningful insights.

References

[1] Hunter. (2020, March). New survey reveals the pandemic’s impact on Americans’ eating behaviors [Press release]. Retrieved from https://www.hunterpr.com/newsroom/new-survey-reveals-the-pandemics-impact-on-americans-eating-behaviors/

[2] International Food Information Council. (2020). 2020 Food & Health Survey: COVID-19 Pandemic’s Impact on Food and Health. Retrieved from https://foodinsight.org/2020-food-and-health-survey/

[3] Food.com. (n.d.). Retrieved March 11, 2023, from https://www.food.com/

[4] Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, 42(8), 30-37. doi: 10.1109/MC.2009.263

[5] Leskovec, J., Rajaraman A., and Ullman J. (2011). Mining of Massive Datasets. Retrieved from http://infolab.stanford.edu/~ullman/mmds/book.pdf (p. 326).

[6] Lavorini, V. (2018, November 22). Gaussian mixture model clusterization: How to select the number of components (clusters). Towards Data Science. https://towardsdatascience.com/gaussian-mixture-model-clusterization-how-to-select-the-number-of-components-clusters-553bef45f6e4

[7] DiFrancesco, V. (2021, February 25). Gaussian Mixture Models for Clustering. A beginner’s guide for expanding your clustering knowledge beyond K-Means. https://towardsdatascience.com/gaussian-mixture-models-for-clustering-3f62d0da675

[8] Deutschman, Z. (2023, January 24). Recommender Systems: Machine Learning Metrics and Business Metrics https://neptune.ai/blog/recommender-systems-metrics.

[9] Longo, C. (2018, November 23). Evaluation Metrics for Recommender Systems. Towards Data Science. https://towardsdatascience.com/evaluation-metrics-for-recommender-systems-df56c6611093

[10] Zhou, T., Kuscsik, Z., Liu, J. G., Medo, M., Wakeling, J. R., & Zhang, Y. C. (2010). Solving the apparent diversity-accuracy dilemma of recommender systems. Proceedings of the National Academy of Sciences, 107(10), 4511-4515. https://arxiv.org/pdf/0808.2670.pdf

[11] Ge, M., Delgado-Battenfeld, C., & Jannach, D. (2010, September). Beyond accuracy: evaluating recommender systems by coverage and serendipity. In Proceedings of the fourth ACM conference on Recommender systems (pp. 257-260). ACM.

[12] Lendave, V. (2021, October 24). How to Measure the Success of a Recommendation System? Analytics India Magazine. https://analyticsindiamag.com/how-to-measure-the-success-of-a-recommendation-system/

[13] Longo, C. (2021). Recmetrics. A python library of evaluation metrics and diagnostic tools for recommender systems. GitHub. https://github.com/statisticianinstilettos/recmetrics

[14] Horel, E., Giesecke, K., Storchan, V., Chittar, N. (2020). Explainable Clustering and Application to Wealth Management Compliance. https://arxiv.org/pdf/1909.13381.pdf

[15] Horel, E., Giesecke, K. (2019). Computationally Efficient Feature Significance and Importance for Machine Learning Models. https://arxiv.org/pdf/1905.09849.pdf

[16] Agrawal, S. (2021). Recommendation System - Understanding The Basic Concepts. What Is Recommendation System? https://www.analyticsvidhya.com/blog/2021/07/recommendation-system-understanding-the-basic-concepts/

Appendix

Exhibit 1. Choosing the Clustering Method for Users

Representative-based Clustering

K-means Clustering

K-Medoids Clustering

Hierarchical Clustering

Single Linkage

Average Linkage

Complete Linkage

Ward's Linkage

Density-based Clustering

DBSCAN

OPTICS

100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:15<00:00,  3.86s/it]
100%|█████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 284.66it/s]

Probabilistic Clustering Method

Gaussian Mixture Clustering

Exhibit 2. Choosing the Clustering Method for Items

Representative-based Clustering

K-means Clustering

K-medoids Clustering

Hierarchical Clustering

Single Linkage

Average Linkage

Complete Linkage

Ward's Linkage

Density-based Clustering

DBSCAN

OPTICS

Probabilistic Clustering

Gaussian Mixture